home *** CD-ROM | disk | FTP | other *** search
- Sprite on Mach
-
- Mike Kupfer
- University of California, Berkeley
-
- Abstract
-
- Sprite is a distributed operating system that supports a fast
- single-image network file system and transparent process migration.
- Over a period of 19 months I ported Sprite to run as a server on top
- of the Mach 3.0 microkernel. Although the resulting server does not
- implement some Sprite features, it can run in an existing Sprite
- cluster, and it supports standard UNIX programs like vi, gcc, and
- make.
-
- Porting Sprite to Mach was generally straightforward, though there
- were some difficulties. Many of the problems were related to
- asynchronous interactions between the Sprite server, Mach, and Sprite
- user processes. Others resulted from trying to maintain native
- Sprite's internal interfaces in the Sprite server.
-
- Although machine-dependent code accounts for 14% of the source lines
- in the Sprite kernel, it accounts for only 1% of the source lines in
- the Sprite server. I believe that this will significantly simplify
- porting Sprite to new hardware platforms. Unfortunately, the Sprite
- server runs the Andrew benchmark at only 38% of the speed of native
- Sprite. None of the performance problems appear insurmountable, but
- they could require a long time to track down and fix.
-
-
- 1. Introduction
-
- Sprite is a distributed operating system that was developed at the
- University of California, Berkeley [1]. It features a fast
- single-image network file system [2], transparent process migration
- [3], and a high-performance log-structured local filesystem [4].
- Sprite was originally written to support the SPUR multiprocessor
- project at Berkeley [5], so the kernel is multi-threaded and uses
- fine-grain locking.
-
- I have ported the Sprite kernel to run as a server on top of Mach
- 3.0. The server does not have complete Sprite functionality, but it
- can run most UNIX commands as a client of native Sprite file servers.
- I used the modified Andrew benchmark [8] to partially tune the server
- and to analyze the remaining performance problems.
-
- In the following sections I will explain why I ported Sprite to Mach;
- I will sketch the design of the Sprite server, with particular
- attention to some of the problem areas; I will discuss the relative
- sizes and portability of native Sprite and the Sprite server; I will
- analyze the performance of the Sprite server; I will evaluate the
- success of the Sprite server; and I will present some conclusions that
- I have drawn.
-
-
- 2. Why mix Sprite and Mach?
-
- The Sprite project became interested in Mach for three reasons: to
- make Sprite more portable, to make Sprite easier to distribute to
- external sites, and to verify whether Mach is a suitable platform for
- building distributed systems.
- % Sprite currently
- % runs on the DECStation 5000 and on Sun SPARC systems. In the past it
- % has also run on the DECStation 3100, the Sun 2, the Sun 3, prototype
- % SPUR systems, and the Sequent Symmetry. Most of these ports were
- % performed by students in the Sprite group. These ports were
- % time-consuming and delayed progress on the students' research work.
- % It was hoped that by putting Sprite on top of Mach, the Sprite group
- % could leave the hardware ports to the Mach community, freeing the
- % Sprite group to concentrate on more interesting research projects.
-
- (The full paper will elaborate on these reasons. The portability and
- software distribution issues are a result of the small size of the
- Sprite group. The discussion about Mach being a suitable platform
- will mention the UX server and the shared memory server as being the
- only relevant prior art for a highly-networked system like Sprite. I
- will claim that Sprite stresses Mach more than either of these
- servers, and we in the Sprite group were concerned about possible
- performance problems in the network code.)
-
- These project goals led to the following design goals. First, the
- revised Sprite system should be highly portable. This led to
- implementing Sprite as a server, under the assumption that
- machine-dependent code such as device drivers and memory management
- would be written by the Mach community, not the Sprite group. A
- second goal was simplicity, so as not to complicate future Sprite
- research work. A third goal was to minimize changes to existing
- Sprite code. The point of this goal was to minimize development time
- and to simplify synchronizing native Sprite code changes with the
- Sprite server. It led to a single-server, rather than a multi-server,
- approach. The fourth goal was to get performance comparable to native
- Sprite. Slight performance degradation was considered acceptable
- in return for improved portability.
-
-
- 3. Design of the Sprite server
-
- As a first approximation, Mach can be thought of as a high-level
- abstract machine. It provides processes, scheduling, a simple
- interface for accessing a process's memory, and a machine-independent
- interface for accessing local devices such as disks or networks. The
- C Threads library provides threads, locks, and condition variables.
- Thus the first step in porting Sprite was to rip out all the low-level
- native code whose functionality was already provided by Mach,
- replacing it with a facade that maintains Sprite's internal
- interfaces. This process is similar to the transformation that was
- used to implement the ``UX'' UNIX server [6, 7].
-
- On the user side, emulation code was added to the C runtime library.
- This code translates Sprite system calls into Matchmaker RPC requests
- to the Sprite server. Again, this setup is similar to the one used in
- the UNIX server.
-
- One important difference between the UNIX and Sprite servers is how
- they use external pagers. The UNIX server provides an external pager
- for mapped files, including program text, but it uses the Mach default
- pager for swap storage (i.e., a process's heap and stack). The Sprite
- server provides an external pager that backs the entire address space
- of a user process, including its heap and stack. This design lets a
- process page from a network file server, as is done in native Sprite.
- Sprite uses this feature to support diskless operation - almost all
- Sprite workstations are diskless - and to support process migration.
-
- Most of the changes to port Sprite to Mach were straightforward.
- Nonetheless, some changes presented unexpected difficulties. Many of
- the problems were related in some way to asynchronous interactions
- between the Sprite server, Sprite user processes, and Mach. For
- example, some data structures, such as memory objects and process
- table entries, are logically shared between the Sprite server and
- Mach, and special care is needed to ensure that the data structures
- are not freed prematurely or leaked. (The full paper will elaborate
- on one or both of these examples, depending on space availability,
- with pictures. If there is space, it will also discuss a race in the
- implementation of sbrk() and complications to handling copyin errors.)
- No single problem was insurmountable, but each required extra time to
- get the details of the design and implementation correct.
-
- % For example, consider what happens when a process maps a file and then
- % exits. The Sprite server cannot free its internal representation of
- % the mapped file until it receives a memory_object_terminate
- % notification from the Mach kernel. Now consider what happens if a new
- % user process tries to map the same file before the previous mapping
- % has been cleaned up. Should the server create a new object to
- % represent the mapping? Should it wait for the old one to be
- % destroyed?
-
- A similar problem was how to invoke a user signal handler for signals
- that are generated asynchronously. (The full paper will explain this
- in more detail. The current implementation does not support
- asynchronous signal delivery when there is a user handler for the
- signal. Instead the server passes back a ``pending signal'' flag for
- almost all requests. If this flag is set, the emulation library gets
- the signal information and invokes the handler.)
-
- % In native Sprite the kernel can
- % simply wait until the user process returns from a time slice interrupt
- % or system call. Strictly speaking, this is not an option for the
- % Sprite server because the process could be in an infinite loop waiting
- % for a signal. Instead the server can freeze the process, change its
- % registers and stack to invoke the signal handler, and resume the
- % process. Unfortunately, if the user process is making a Mach system
- % call, signal delivery should be delayed until the system call is
- % complete. Because of the potential complexity and performance
- % problems associated with a general solution (e.g., looking at the
- % process's program counter), we decided not to attempt asynchronous
- % signal delivery when there is a signal handler. Instead Sprite calls
- % return a flag to the emulation library saying that there is a pending
- % signal, and the emulation library makes a special call to invoke the
- % signal handler. We expect this design to be acceptable in practice.
-
- Maintaining Sprite's internal process interface was also complicated
- in places. Native Sprite is like UNIX in that user processes run in
- ``kernel mode'' after an interrupt or after executing a trap. As a
- consequence, native Sprite was designed so that a process can kill or
- suspend only itself. An attempt to kill or suspend a different
- process is simply a request, which the target process eventually acts
- on. In contrast, the Sprite server is a separate Mach task that user
- processes invoke via RPCs. The server can simulate the effects of
- traps and interrupts through an internal pool of threads, but there
- are some problems with this approach. First, special care is needed,
- e.g., in exit() and exec(), to ensure that the threads correctly
- manage internal resources like buffers. Second, because the Sprite
- server does not handle time slice interrupts, it cannot guarantee that
- user processes will notice attempts to kill or suspend them. Thus,
- unlike in native Sprite, user processes must be able to kill or
- suspend other user processes. This requirement led to many changes in
- the locking strategy employed by the process management and signals
- code in Sprite, and these changes were the source of many bugs.
- % Deadlocks in the exit() code were common during development and
- % testing.
- % Special code to handle suspension is needed in the system request
- % code, in case a process is suspended after it has submitted a
- % request but before the request has been acted on. Also, native
- % Sprite integrates suspension with the sched/sync code; the server
- % requires that ugly wait loop.
-
-
- 4. Status and code measurements
-
- This section explains the current status of the Sprite server and
- presents some code size measurements.
-
- I developed the Sprite server over a period of 19 months. The first 7
- months of the project were a half-time effort; the remaining 12 months
- were full-time, including 2.5 months of performance tuning. I began
- development on a Sun 3 but later moved to a DECStation 5000.
-
- % Moved to ds5000 for performance reasons and to avoid having to deal
- % with signals support for 2 architectures.
-
- The Sprite server supports standard UNIX programs like vi, gcc, and
- make. Nonetheless, the implementation is still a prototype, and it
- lacks some important features. The server design includes support for
- these features; I simply lacked the time to implement them. The
- missing features include
-
- - binary compatibility (with both the vendor operating system and
- native Sprite)
- - local disk access
- - support for debugging user processes
- - process migration
-
- A Sprite DECStation kernel with the same functionality as the Sprite
- server contains roughly 143,000 lines of code, of which 27,000 lines
- are machine-dependent, including 3600 lines of assembly code. The
- Sprite server contains 111,000 lines of code, of which 1300 lines are
- machine-dependent. The server contains 4 lines of assembly code to
- help debug locking errors.
-
- In going from native Sprite to the Sprite server, I preserved about
- 90,000 lines of code (63%). I threw away another 39,000 lines of
- code (27%), and I rewrote the remaining code, about 14,000 lines
- (10%), for use with Mach. The thrown away code consisted of low-level
- (and often machine-dependent) code for device, process, and memory
- management, plus code that duplicated Mach functionality (e.g., the
- process scheduler and a kernel debugger). Most of the rewritten code
- was for process and memory management, ``system call'' support, and
- Sprite's internal lock package.
-
- To support the missing functionality mentioned earlier in this
- section, I would have to port an additional 58,000 lines of code, of
- which 2400 lines are machine-dependent. I expect that about 52,000
- lines of this code would not require change, another 2000 lines would
- get thrown away, and the remaining 4000 lines would get rewritten.
-
- (The full paper will use bar charts to present some or all of these
- numbers.)
-
-
- 5. Performance
-
- Due to time constraints, I used only one benchmark to tune the Sprite
- server: the modified Andrew benchmark [8]. This benchmark copies a
- file tree, scans the tree several times, and then compiles some window
- system code. Unless otherwise noted, all the performance numbers in
- this section were obtained on a DECStation 5000 with at least 32
- megabytes of memory, using files stored on a native Sprite file
- server.
-
- As (will be) shown in the table, native Sprite runs the benchmark in
- 91 seconds. The UNIX server (UX34) and Ultrix 4.2, using only
- local files, need around 118 seconds. The time for Ultrix 4.2 goes up
- to 141 seconds when the files are on a Prestoserve-equipped NFS
- server. The Sprite server's fastest time for the benchmark is 237
- seconds.
-
- Early versions of the Sprite server required around 440 seconds to run
- the benchmark. Most of the performance fixes to reach the current
- time were in three areas: caching of text segments, reduced Sprite RPC
- latency, and more efficient string management in exec(). (The full
- paper will explain these in more detail and tell how much they
- contributed to the overall improvement.)
-
- I have analyzed the 146 second gap between native Sprite and the
- Sprite server, and I can explain about 98 seconds of it. I know
- more or less where the remaining 48 seconds are being spent, but not
- why. Tuning stopped when I changed employers in July 1992.
-
- The paper will explain the current top 5 understood or partially
- understood bottlenecks. They are
-
- (1) overhead in fork() from copying the heap and stack, which accounts
- for 45 seconds. This problem is present because Mach does not
- currently support copy-on-write for externally managed memory objects.
-
- (2) copyin/copyout delays for read() and write(), which account for 18
- seconds. The UNIX server uses mapped files to avoid this problem. A
- similar approach for Sprite is not straightforward, because Sprite
- doesn't support consistent access to mapped files that are shared by
- multiple clients.
-
- (3) higher Sprite RPC latency (despite the improvements made during
- tuning), which accounts for 11 seconds. I suspect that most of the
- time is spent in the Mach networking code, though I was unable to
- verify this for the network input path.
-
- (4) overhead after exec() from faulting in heap and stack pages, which
- accounts for 8 seconds. The Sprite server currently creates new heap
- and stack segments in exec(); I expect that a better approach is to
- zero out and reuse the old segments.
-
- (5) unexplained Sprite ``system call'' time, which accounts for 8
- seconds. I suspect that this problem results from unnecessary
- context switching, and that it can be fixed by changing how the Sprite
- server garbage collects process table entries.
-
- % Put a summary (a short, coherent summary) of the 5 items here?
-
- % Of the understood bottlenecks, the largest is in fork(). Although the
- % Mach external pager interface supports copy-on-write copying of memory
- % objects, the implementation is broken, so the benchmark spends 45
- % seconds just copying heap and stack pages. The UNIX server avoids
- % this problem by using the default pager for user heaps and stacks, but
- % as explained earlier, this is not an option for Sprite. Fortunately,
- % there are plans to fix copy-on-write for externally managed memory
- % objects [JSB private comm.].
- %
- % The next largest bottleneck (18 seconds) is spent copying bytes to and
- % from user space, mostly for read() and write(). Our measurements
- % indicate that the real culprit is the cost of the cross-address-space
- % memory operations, rather than the bcopy cost just to move the bytes.
- % The UNIX server uses mapped files to avoid this bottleneck [ref UX
- % paper and Dean/Armand paper].
- % Unfortunately, mapped files in Sprite do not enjoy the same network
- % consistency guarantees that files accessed through a read/write
- % interface enjoy. This is not an inherent flaw in Sprite's design, but
- % fixing it would require changes to native Sprite as well as to the
- % Sprite server. It may be worth exploring other alternatives (e.g.,
- % having the server cache mappings into user processes) before resorting
- % to mapped files.
- %
- % Another large slowdown in the system is caused by an increase in RPC
- % latency. Native Sprite can do a null RPC between two DECstation 5000s
- % in 0.85 milliseconds. The null RPC time between the Sprite server and
- % a native Sprite system is 2.2 milliseconds. For the Andrew benchmark,
- % which causes about 8200 Sprite RPCs, this translates to a
- % slowdown of about 11 seconds. It's not clear what the right solution
- % is for this problem. (Adopt NORMA RPC? (See 1992 Mach Usenix paper;
- % is it applicable?) Replace the packet filter architecture with the
- % OSF proposal? Use the X-kernel?)
- %
- % Another slowdown is caused by unnecessary faults during exec(). The
- % Sprite server currently destroys and recreates the heap and stack
- % segments during exec(). The Andrew benchmark causes about
- % 8300 heap and stack faults; at 1 millisecond per fault (where does
- % this number come from??), this causes a delay of about 8 seconds. To
- % fix this problem, the Sprite server can reuse heap and stack segments.
- % Unfortunately, it will return once the Sprite server starts using
- % copy-on-write for fork().
- %
- % Another large bottleneck is in the code that processes Sprite "system
- % calls" (really MIG calls) from Sprite user processes. After comparing
- % the UNIX server "system call" times with the Sprite server times, we
- % estimate that this bottleneck is adding 8 seconds to the Andrew
- % benchmark time. We think that this is due to excessive context
- % switches in that part of the Sprite server, but this hypothesis has
- % not yet been verified.
-
-
- 6. Evaluation
-
- This section will evaluate the Sprite server according to the design
- goals presented earlier in the paper: portability, simplicity,
- minimizing changes to native Sprite, and performance.
-
- The Sprite server appears to be highly portable. It contains less
- code than an equivalent Sprite kernel, plus there is very little
- machine-dependent code and essentially no assembly code. The
- unimplemented code from native Sprite is mostly machine-independent,
- so a fully functional Sprite server should still be very portable.
-
- Although the Sprite server is not as simple as I'd like (because of
- the design complications described in section 3), most of the Sprite
- server design and implementation were straightforward, and debugging
- the server with gdb was generally easy. In fact, I was able to
- track down some bugs from native Sprite that had long been puzzling
- the Sprite group.
-
- Porting Sprite to Mach involved an acceptably small number of changes
- to native Sprite. Most of the server code is identical to the kernel
- code from which it is derived, and I made few changes to Sprite's
- internal interfaces. Furthermore, the Sprite server runs in an
- existing Sprite cluster. No changes - other than adding a new machine
- type - were made to the native Sprite systems to accommodate the
- Sprite server.
-
- Unfortunately, I was not able to meet my performance goal. Assuming
- that I can apply the fixes that I know of or expect to see soon, I am
- still left with a performance of about 165 seconds for the Andrew
- benchmark, which is still worse than Ultrix using NFS. I believe that
- additional tuning could reduce the benchmark time even further, though
- this would probably be a time-consuming task.
-
-
- 7. Future work
-
- As mentioned earlier in the paper, I am no longer employed by the
- University of California, and it is unlikely that anyone will do
- additional work on the Sprite server. If development were to
- continue, the obvious first task would be to continue performance
- tuning. More work is needed to make the Sprite server perform
- adequately on the Andrew benchmark, and there are other benchmarks
- that the server should be tested on. Furthermore, a
- production-quality Sprite server would require the remaining
- functionality mentioned in section 4.
- % mention the WPI synthetic benchmarks [ref]? It would be nice
- % if you could find a different reference? (Reread the Mach workshop
- % paper; maybe it's better than you remember.)
-
- In addition to mundane development work, it would be useful to conduct
- additional research using the Sprite server. For example, it would be
- interesting to compare the performance of the Sprite server with the
- performance of a similar server that uses Mach IPC for access to
- remote devices or process migration.
- % Look at the FS/VM cache boundary stuff?
-
-
- 8. Conclusions
-
- Porting Sprite to Mach was a mixed success. On the one hand, it
- greatly reduced the amount of machine-dependent code in Sprite, which
- should make Sprite much easier to port to new hardware. The
- asynchronous interfaces provided by Mach require some unpleasant
- complexity in the Sprite server, but this complexity is manageable.
- On the other hand, the server is almost unusably slow, and it appears
- that much work is still needed to bring performance within striking
- distance of the native system.
-
- At least one third of the performance gap results from the distributed
- nature of Sprite. However, the slowdown is not primarily due to RPC
- latency or throughput problems. Rather, it is due to the Sprite
- server's heavy use of an external pager, plus problems such as its
- inability to use mapped files to avoid copy overhead. The lesson here
- seems to be that there is more to high-performance distributed systems
- than fast communication, and Mach 3.0 currently does not support some
- designs as well as others. For some problems (e.g., copy-on-write for
- external pagers) it should be possible to fix Mach, but in other cases
- (e.g., use of mapped files for performance), it may be necessary to
- redesign the server instead.
-
- Thus, if I ask whether it is worth porting an existing operating
- system to run on top of Mach, the answer is ``maybe.'' For a research
- system like Sprite, with its small development community, increased
- portability seems attractive enough to warrant a large one-time
- porting and tuning effort. For commercial systems, though, a more
- portable server seems less alluring, particularly if the vendor has to
- port Mach as well. I think that commercial vendors will have to
- consider the potential benefits of Mach other than portability before
- deciding to base their systems on it.
-
-
- References
-
- [1] John Ousterhout et al, ``The Sprite Network Operating
- System'', IEEE Computer, February 1988.
-
- [2] Brent Welch, ``Naming, State Management, and User-Level
- Extensions in the Sprite Distributed File System'', University
- of California, Berkeley, Technical Report UCB/CSD 90/567,
- February 1990.
-
- [3] Fred Douglis and John Ousterhout, ``Transparent Process
- Migration: Design Alternatives and the Sprite
- Implementation'', Software--Practice & Experience, July 1991.
-
- [4] Mendel Rosenblum and John Ousterhout, ``The Design and
- Implementation of a Log-Structured File System'', Proc.
- Symposium on Operating Systems Principles, October 1991.
-
- [5] M. D. Hill et al, ``Design Decisions in SPUR'', IEEE Computer,
- November 1986.
-
- [6] David Golub et al, ``UNIX as an Application Program'', Proc.
- Summer 1990 USENIX Conference, June 1990.
-
- [7] David L. Black et al, ``Microkernel Operating System
- Architecture and Mach'', Proc. USENIX Microkernel Workshop,
- April 1992.
-
- [8] John Ousterhout, ``Why Aren't Operating Systems Getting Faster
- As Fast as Hardware?'', Proc. Summer 1990 USENIX Conference,
- June 1990.
-
- % [9] Ken Shirriff and John Ousterhout, ``A Trace-driven Analysis of
- % Name and Attribute Caching in a Distributed File System'',
- % Proc. Winter 1992 USENIX Conference, January 1992.
-